Skip to content
This repository was archived by the owner on Oct 27, 2025. It is now read-only.

Export documents and assets (WHIT-2419)#3311

Merged
ChrisBAshton merged 16 commits intomainfrom
export-news-articles
Sep 10, 2025
Merged

Export documents and assets (WHIT-2419)#3311
ChrisBAshton merged 16 commits intomainfrom
export-news-articles

Conversation

@ChrisBAshton
Copy link
Copy Markdown
Contributor

@ChrisBAshton ChrisBAshton commented Aug 29, 2025

What

This PR defines two rake tasks that can be used to export content out of Content Publisher:

  1. export:live_document_and_assets, which exports a single document, given its content ID. By default it outputs to STDOUT but will write to a JSON file if a path is provided.
  2. export:live_documents_and_assets, which exports all of the live documents and assets. By default it outputs to STDOUT but will write a separate JSON file for each exported document if an output directory is provided.

Usage (respectively):

# single document
rake export:live_document_and_assets["4c2ff601-4494-458b-bdc3-7d358122c2ae","./chris/foo.json"]

# all the documents
rake export:live_documents_and_assets["./chris"]

The latter rake task is what we imagine we'll use alongside an import script in Whitehall, to be built in https://gov-uk.atlassian.net/browse/WHIT-2440. Sister PR: alphagov/whitehall#10635

Why

We want to retire Content Publisher. Whilst we could 'withdraw' all content to avoid having to migrate it, we'd then need to make sure we're able to find some way of hacking an update directly into Publishing API in an emergency, and we figured it would ultimately just complicate the publishing landscape even further. Better to export the content out of Content Publisher and back into Whitehall, where it can be managed centrally.

On investigation, there are only around 3000 documents to export, of two different content types, and they're all low value pieces (news articles that have served their purpose already and, as stated above, have already been cleared for mass-withdrawal by our Content colleagues provided we can be sure we can apply edits retrospectively). For that reason, a thorough reproduction of every bit of document history is not necessary here: we want to do a lift and shift of content as best as we can without expending weeks' worth of effort. We're therefore only exporting the latest published edition for each document (no drafts) and are not bothering to transfer over the removed or unpublished-with-redirect items, since these would mean extra complexity for virtually no benefit.

JIRA: https://gov-uk.atlassian.net/browse/WHIT-2419


⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

Follow these steps if you are doing a Rails upgrade.

@ChrisBAshton ChrisBAshton force-pushed the export-news-articles branch 2 times, most recently from 528572f to 08c347c Compare August 29, 2025 13:36
@ChrisBAshton ChrisBAshton changed the title WIP: export rake task (WHIT-2419) WIP: Export documents and assets (WHIT-2419) Aug 29, 2025
@ChrisBAshton ChrisBAshton force-pushed the export-news-articles branch 15 times, most recently from 536a1d1 to 2699dab Compare September 1, 2025 09:15
@GDSNewt GDSNewt force-pushed the export-news-articles branch from b5cb257 to 55870ef Compare September 1, 2025 13:16
@ChrisBAshton ChrisBAshton force-pushed the export-news-articles branch 11 times, most recently from fc2c1f8 to 2bf11a8 Compare September 2, 2025 13:05
Content Publisher has two means of designating an edition to be
'political':

1. 'system political': if associated with role appointments or
   political organisations
2. 'editor political': if the publisher has explicitly marked the
   document as political via the "Gets history mode" checkbox in
   Content Publisher's UI.

For the purposes of importing these docs into Whitehall, we don't
particularly care the _means_ by which a document is considered
political. We don't need the overhead of having to map one 'flavour'
of political to Whitehall's representation of that 'flavour'. It
is enough to just pass the derived boolean from Content Publisher,
into Whitehall (which has its own 'political' boolean override
equivalent: https://github.com/alphagov/whitehall/blob/f5afce65514b02767df8625ccfa40c681046c95e/db/schema.rb#L444)
Whitehall stores change notes and editorial remarks _per edition_,
yet we're only planning to migrate the latest (published) edition
out of Content Publisher, for simplicity. So we imagine this
property will be used to form one large 'editorial remark' in
Whitehall (and potentially as one large entry in the public
changenote history too).

The TimelineEntry class is doing the heavy lifting here. I've
generally followed the same logic as _content_publisher_entry.html.erb
in deciding which properties to expose.

The way we're testing this behaviour (by layering edition upon
edition in a unit test) doesn't appear to have been done before,
as we were getting an Uninitialized Constant error on
`RevisionUpdater::Image`. Adding a `require_relative` to the
`RevisionUpdater` class worked around that.
Copy link
Copy Markdown
Contributor

@lauraghiorghisor-tw lauraghiorghisor-tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me, few suggestions around redundant or needed data in comments and plenty of musings that might be useful for the import work.

FYI - There are quite a few TODOs in the commit messages, which are actually still to do.

Having done some import work with Tony before, I would suggest going through the exercise of creating both document types in WH from the console (including image and file attachment) to better understand what you actually need and address those TODOs. That is, before merging this in. Otherwise I bet this will be followed up by a few other PRs 😅 #burned_before

Whilst it's nice and clean to keep the export WH-agnostic, I'd probably go one step further and do the required mappings here already (like CP state to WH state, and other specific fields we'd want there). What we learned from the other import work was that, if you keep the export mappings un-opinionated, you just end up having to test both ends and it wasn't worth it as this is actually an export for WH anyway.

Whilst digging through assets I realised that plenty of those which are not strictly on the live edition are still live. Perhaps moving everything in WH takes us one step closer to being able to do a data cleanup in AM one day.

Comment thread app/models/whitehall_migration/document_export.rb Outdated
Comment thread app/models/whitehall_migration/document_export.rb
@ChrisBAshton ChrisBAshton force-pushed the export-news-articles branch 2 times, most recently from 00420e8 to 4943459 Compare September 4, 2025 16:56
Comment thread app/models/whitehall_migration/document_export.rb
Comment thread app/models/whitehall_migration/document_export.rb
Comment thread app/models/whitehall_migration/document_export.rb Outdated
@ChrisBAshton ChrisBAshton force-pushed the export-news-articles branch 2 times, most recently from 72e617b to ac61340 Compare September 8, 2025 08:03
ChrisBAshton added a commit to alphagov/whitehall that referenced this pull request Sep 9, 2025
This commit replaces `all_asset_variants_uploaded?` with a simpler
`asset_uploaded?` method for both images and attachments. Instead of
requiring all variant files (e.g. `s960`, `s300`, etc.) to be present
before allowing publishing or previewing, we now consider the asset
“ready” if the original file has successfully uploaded.

In practice, all variants are generated and uploaded within milliseconds
of the original upload, and failures nearly always stem from the original file
itself. For example, if Asset Manager has an outage or detects a virus in
the uploaded asset, then the original variant will fail to successfully
upload, just as the variants will not. If the original variant has
successfully uploaded, then we can be highly confident (though can't
_guarantee_) that its variants will have to. The minimal risk hypothetical
edge case is a risk we can live with to allow for the following benefits:

* Supports content with partial variant sets (we will be importing news
  articles from Content Publisher, which have only the `s960` and `s300`
  variants).
* Enables future support for dynamically generated variants (e.g. via
  Fastly image resizing) without requiring full variant pre-generation.
* Greatly simplifies the codebase and removes brittle assumptions.

See alphagov/content-publisher#3311 (comment)
for a discussion on the different variant sizes and where they're
used.

As you can see in this commit, there is a lot of room for improvement
and consolidation across Whitehall. We have several implementations
of `asset_uploaded?` that are identical across modules. In practice,
most of these implementations were already only checking for the
'original' variant, and have now been simplified to make that
absolutely explicit. We should revisit this commit later to
consider how to extract some of this behaviour out into a shared
module so that we're not duplicating identical logic in multiple
places.
ChrisBAshton added a commit to alphagov/whitehall that referenced this pull request Sep 9, 2025
This commit replaces `all_asset_variants_uploaded?` with a simpler
`asset_uploaded?` method for both images and attachments. Instead of
requiring all variant files (e.g. `s960`, `s300`, etc.) to be present
before allowing publishing or previewing, we now consider the asset
“ready” if the original file has successfully uploaded.

In practice, all variants are generated and uploaded within milliseconds
of the original upload, and failures nearly always stem from the original file
itself. For example, if Asset Manager has an outage or detects a virus in
the uploaded asset, then the original variant will fail to successfully
upload, just as the variants will not. If the original variant has
successfully uploaded, then we can be highly confident (though can't
_guarantee_) that its variants will have to. The minimal risk hypothetical
edge case is a risk we can live with to allow for the following benefits:

* Supports content with partial variant sets (we will be importing news
  articles from Content Publisher, which have only the `s960` and `s300`
  variants).
* Enables future support for dynamically generated variants (e.g. via
  Fastly image resizing) without requiring full variant pre-generation.
* Greatly simplifies the codebase and removes brittle assumptions.

See alphagov/content-publisher#3311 (comment)
for a discussion on the different variant sizes and where they're
used.

As you can see in this commit, there is a lot of room for improvement
and consolidation across Whitehall. We have several implementations
of `asset_uploaded?` that are identical across modules. In practice,
most of these implementations were already only checking for the
'original' variant, and have now been simplified to make that
absolutely explicit. We should revisit this commit later to
consider how to extract some of this behaviour out into a shared
module so that we're not duplicating identical logic in multiple
places.
ChrisBAshton added a commit to alphagov/whitehall that referenced this pull request Sep 9, 2025
This commit replaces `all_asset_variants_uploaded?` with a simpler
`asset_uploaded?` method for both images and attachments. Instead of
requiring all variant files (e.g. `s960`, `s300`, etc.) to be present
before allowing publishing or previewing, we now consider the asset
“ready” if the original file has successfully uploaded.

In practice, all variants are generated and uploaded within milliseconds
of the original upload, and failures nearly always stem from the original file
itself. For example, if Asset Manager has an outage or detects a virus in
the uploaded asset, then the original variant will fail to successfully
upload, just as the variants will not. If the original variant has
successfully uploaded, then we can be highly confident (though can't
_guarantee_) that its variants will have to. The minimal risk hypothetical
edge case is a risk we can live with to allow for the following benefits:

* Supports content with partial variant sets (we will be importing news
  articles from Content Publisher, which have only the `s960` and `s300`
  variants).
* Enables future support for dynamically generated variants (e.g. via
  Fastly image resizing) without requiring full variant pre-generation.
* Greatly simplifies the codebase and removes brittle assumptions.

See alphagov/content-publisher#3311 (comment)
for a discussion on the different variant sizes and where they're
used.

As you can see in this commit, there is a lot of room for improvement
and consolidation across Whitehall. We have several implementations
of `asset_uploaded?` that are identical across modules. In practice,
most of these implementations were already only checking for the
'original' variant, and have now been simplified to make that
absolutely explicit. We should revisit this commit later to
consider how to extract some of this behaviour out into a shared
module so that we're not duplicating identical logic in multiple
places.
ChrisBAshton added a commit to alphagov/whitehall that referenced this pull request Sep 9, 2025
This commit replaces `all_asset_variants_uploaded?` with a simpler
`asset_uploaded?` method for both images and attachments. Instead of
requiring all variant files (e.g. `s960`, `s300`, etc.) to be present
before allowing publishing or previewing, we now consider the asset
“ready” if the original file has successfully uploaded.

In practice, all variants are generated and uploaded within milliseconds
of the original upload, and failures nearly always stem from the original file
itself. For example, if Asset Manager has an outage or detects a virus in
the uploaded asset, then the original variant will fail to successfully
upload, just as the variants will not. If the original variant has
successfully uploaded, then we can be highly confident (though can't
_guarantee_) that its variants will have to. The minimal risk hypothetical
edge case is a risk we can live with to allow for the following benefits:

* Supports content with partial variant sets (we will be importing news
  articles from Content Publisher, which have only the `s960` and `s300`
  variants).
* Enables future support for dynamically generated variants (e.g. via
  Fastly image resizing) without requiring full variant pre-generation.
* Greatly simplifies the codebase and removes brittle assumptions.

See alphagov/content-publisher#3311 (comment)
for a discussion on the different variant sizes and where they're
used.

As you can see in this commit, there is a lot of room for improvement
and consolidation across Whitehall. We have several implementations
of `asset_uploaded?` that are identical across modules. In practice,
most of these implementations were already only checking for the
'original' variant, and have now been simplified to make that
absolutely explicit. We should revisit this commit later to
consider how to extract some of this behaviour out into a shared
module so that we're not duplicating identical logic in multiple
places.
GDSNewt and others added 7 commits September 9, 2025 09:43
We expect that we'll drop the `high_resolution` variant at the
point of importing into Whitehall, but have kept it there for
completion. The existing 960-wide and 300-wide variants match some
of the allowed sized in Whitehall, so should be importable.

We expect the exported image can be mapped to Whitehall's modelling
like so:

```
$ Image.last
=> #<Image:0x0000ffff63777d98
 id: 796660,
 image_data_id: 258369,
 edition_id: 1691184,
 alt_text: <ALT TEXT GOES HERE>,
 caption: <CAPTION GOES HERE>,
 created_at: "2025-06-22 17:38:04.000000000 +0100",
 updated_at: "2025-06-22 17:38:07.000000000 +0100">

$ Image.last.image_data
=> #<ImageData:0x0000ffff634668c0
 id: 258369,
 carrierwave_image: "ELECTRICITY_BILLS_TO_BE_SLASHED_FOR_BUSINESSES_1__...", # TODO: is there a problem here?
 created_at: "2025-06-22 17:38:04.000000000 +0100",
 updated_at: "2025-06-22 17:38:04.000000000 +0100",
 image_kind: "default">

$ Image.last.image_data.assets
=> [#<Asset:0x0000ffff6302e7a8
  id: 3930905,
  asset_manager_id: <DERIVE THE ASSET MANAGER ID FROM THE URL AND PUT THAT HERE>,
  variant: <VARIANT GOES HERE with 's' prefix, e.g. 's300'>,
  created_at: "2025-06-22 17:38:05.267635000 +0100",
  updated_at: "2025-06-22 17:38:05.267635000 +0100",
  assetable_type: "ImageData",
  assetable_id: 258369,
  filename: <DERIVE THE FILENAME FROM THE URL AND PUT THAT HERE>,
  <Asset:0x0000ffff63d33340
  ... # and so on

  TODO: will there be an issue with the missing s216/s465/s630 variants?
```
The original implementation exposed the following properties as well:

- isbn
- unique_reference
- paper_number
- parliamentary_session
- official_document_type

...but Content Publisher only has 54 file attachments in total, and
none of them define values for any of these properties - they are
all `nil`. So there is little point in exposing them as part of
the export.

We imagine the exported attachment can be fairly easily mapped into
Whitehall's attachment modelling:

```
whitehall(prod)> Attachment.find_by(attachment_data: 1343266)
=>
<FileAttachment:0x0000ffff766fd5d0
 id: 8823005,
 created_at: "2025-08-29 16:47:37.000000000 +0100",
 updated_at: "2025-08-29 16:47:37.000000000 +0100",
 title: <TITLE GOES HERE>
 accessible: false, # As per https://gds.slack.com/archives/C03D4A72HGE/p1755694877638679 - all non-HTML attachments should have this unchecked as per guidance.
 isbn: nil,
 unique_reference: nil,
 command_paper_number: nil,
 hoc_paper_number: nil,
 unnumbered_command_paper: nil,
 unnumbered_hoc_paper: nil,
 parliamentary_session: nil,
 type: "FileAttachment",
 ...>
```

...and:

```
AttachmentData.find(1343266)
=>
<AttachmentData:0x0000ffff7671a518
 id: 1343266,
 carrierwave_file: "sample.pdf", # TODO??
 content_type: "application/pdf", # TODO??
 file_size: 131514,  # TODO??
 number_of_pages: 6, # TODO??
 created_at: "2025-08-29 16:47:37.000000000 +0100",
 updated_at: "2025-08-29 16:47:37.000000000 +0100",
 replaced_by_id: nil>
```

Some things to be worked out about the AttachmentData - but I imagine
the import script will need to `curl` the attachment and figure out
its file size, number of pages and content type.
I've created two rake tasks:

1. `export:live_document_and_assets`, which exports a single
   document, given its content ID. By default it outputs to STDOUT
   but will write to a JSON file if a path is provided.
2. `export:live_documents_and_assets`, which exports all of the
   live documents and assets. By default it outputs to STDOUT
   but will write a separate JSON file for each exported document
   if an output directory is provided.

Usage (respectively):

```
rake export:live_document_and_assets["4c2ff601-4494-458b-bdc3-7d358122c2ae","./chris/foo.json"]
```

```
rake export:live_documents_and_assets["./chris"]
```
We're likely to use the former in Whitehall. We won't use the
latter (there is no DB field for it).
We need to be able to associate a news article with a government
ID to properly enable history mode.
The 'document_history' property wasn't exposing any public
changenotes. We'll revisit document_history in the next commit.

I did consider rolling out my own change_notes implementation:

```
  def self.change_notes(document)
    change_notes = document.editions.map do |edition|
      change_note = edition.revision.metadata_revision.change_note
      if change_note.present?
        { change_note:, created_at: edition.created_at, created_by: User.find(edition.created_by_id).email }
      end
    end
    change_notes.compact
  end
```

...but on closer inspection, there is a PublishingApiPayload
`History` class that does exactly what we want and is already
fully tested:
https://github.com/alphagov/content-publisher/blob/ef1ee481f5309e6fd010967231ff5b4723db60ca/spec/lib/publishing_api_payload/history_spec.rb#L1

It is enough for our tests to check that that class is called.

That class also handles the backdating of `first_published_at`,
so we're fixing that whilst we're here.
In the initial implementation of this, we'd wrongly assumed that
public changenotes were included in the document history displayed
to publishers in the UI. We've since defined a separate
`change_notes` method for capturing that. That new method also
includes the 'backdating' information, and we've updated
`first_published_at` to accommodate backdating too, so it's no
longer required to be surfaced here.

What is still needed is some means of exposing 'internal' change
history, i.e. internal notes / editorial remarks, but also the
option of carrying over update/publish dates of each revision and
the author of each.

So in this commit we've renamed the method to better describe the
new scope - `internal_history` and have dropped any complexity
around backdating.
ChrisBAshton added a commit to alphagov/whitehall that referenced this pull request Sep 9, 2025
This commit replaces `all_asset_variants_uploaded?` with a simpler
`asset_uploaded?` method for both images and attachments. Instead of
requiring all variant files (e.g. `s960`, `s300`, etc.) to be present
before allowing publishing or previewing, we now consider the asset
“ready” if the original file has successfully uploaded.

In practice, all variants are generated and uploaded within milliseconds
of the original upload, and failures nearly always stem from the original file
itself. For example, if Asset Manager has an outage or detects a virus in
the uploaded asset, then the original variant will fail to successfully
upload, just as the variants will not. If the original variant has
successfully uploaded, then we can be highly confident (though can't
_guarantee_) that its variants will have to. The minimal risk hypothetical
edge case is a risk we can live with to allow for the following benefits:

* Supports content with partial variant sets (we will be importing news
  articles from Content Publisher, which have only the `s960` and `s300`
  variants).
* Enables future support for dynamically generated variants (e.g. via
  Fastly image resizing) without requiring full variant pre-generation.
* Greatly simplifies the codebase and removes brittle assumptions.

See alphagov/content-publisher#3311 (comment)
for a discussion on the different variant sizes and where they're
used.

As you can see in this commit, there is a lot of room for improvement
and consolidation across Whitehall. We have several implementations
of `asset_uploaded?` that are identical across modules. In practice,
most of these implementations were already only checking for the
'original' variant, and have now been simplified to make that
absolutely explicit. We should revisit this commit later to
consider how to extract some of this behaviour out into a shared
module so that we're not duplicating identical logic in multiple
places.
ChrisBAshton added a commit to alphagov/whitehall that referenced this pull request Sep 10, 2025
This commit replaces `all_asset_variants_uploaded?` with a simpler
`asset_uploaded?` method for both images and attachments. Instead of
requiring all variant files (e.g. `s960`, `s300`, etc.) to be present
before allowing publishing or previewing, we now consider the asset
“ready” if the original file has successfully uploaded.

In practice, all variants are generated and uploaded within milliseconds
of the original upload, and failures nearly always stem from the original file
itself. For example, if Asset Manager has an outage or detects a virus in
the uploaded asset, then the original variant will fail to successfully
upload, just as the variants will not. If the original variant has
successfully uploaded, then we can be highly confident (though can't
_guarantee_) that its variants will have to. The minimal risk hypothetical
edge case is a risk we can live with to allow for the following benefits:

* Supports content with partial variant sets (we will be importing news
  articles from Content Publisher, which have only the `s960` and `s300`
  variants).
* Enables future support for dynamically generated variants (e.g. via
  Fastly image resizing) without requiring full variant pre-generation.
* Greatly simplifies the codebase and removes brittle assumptions.

See alphagov/content-publisher#3311 (comment)
for a discussion on the different variant sizes and where they're
used.

As you can see in this commit, there is a lot of room for improvement
and consolidation across Whitehall. We have several implementations
of `asset_uploaded?` that are identical across modules. In practice,
most of these implementations were already only checking for the
'original' variant, and have now been simplified to make that
absolutely explicit. We should revisit this commit later to
consider how to extract some of this behaviour out into a shared
module so that we're not duplicating identical logic in multiple
places.
@ChrisBAshton
Copy link
Copy Markdown
Contributor Author

Thanks @lauraghiorghisor-tw - I'm happy enough that the export gives us everything we need for alphagov/whitehall#10635 now. Merging 🎉

@ChrisBAshton ChrisBAshton merged commit 7a999c7 into main Sep 10, 2025
11 checks passed
@ChrisBAshton ChrisBAshton deleted the export-news-articles branch September 10, 2025 08:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants